Titanic survival analysis

The Titanic survivors dataset is popularly used to illustrate concepts of data cleaning and exploration.

Let's start by importing the data to a pandas DataFrame from a CSV file:



In [1]:

    
import pandas as pd



In [2]:

    
raw_data = pd.read_csv('datasets/titanic.csv')
raw_data.head()









    Out[2]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [3]:

    
raw_data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

The information above shows that this dataset consists of data for 891 passengers: their names, gender, age, etc (for a complete description of the meaning of each column, check this link)

Missing values

Before starting the data analysis, we need to check the data's "health" by consulting how much information is actually present in each column.



In [4]:

    
# Percentage of missing values in each column
(raw_data.isnull().sum() / len(raw_data)) * 100.0









    Out[4]:





PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

It can be seen that 77% of the passengers do not present information about which cabin it was allocated to. This information could be useful for further analysis but, for now, let's drop this column:



In [5]:

    
raw_data.drop('Cabin', axis='columns', inplace=True)
raw_data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB

The column Embarked, that informs on which port the passenger embarked, only has a few missing entries. Since the amount of passanger with missing values is negligible, they can be discarded without much harm:



In [6]:

    
raw_data.dropna(subset=['Embarked'], inplace=True)
(raw_data.isnull().sum() / len(raw_data)) * 100.0









    Out[6]:





PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.910011
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Embarked        0.000000
dtype: float64

Finally, the age is missing from around 20% of the passengers. It's not reasonable to drop all these passengers nor dropping the column as a whole, so one possible solution is to fill the missing values with the median age of the dataset:



In [7]:

    
raw_data.fillna({'Age': raw_data.Age.median()}, inplace=True)
(raw_data.isnull().sum() / len(raw_data)) * 100.0









    Out[7]:





PassengerId    0.0
Survived       0.0
Pclass         0.0
Name           0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Ticket         0.0
Fare           0.0
Embarked       0.0
dtype: float64

Why use the median instead of the average?

The median represents a robust statistics. A statistics is a number that summarizes a set of values, while a statistics is said to be robust if it is not significantly affected by variations in the data.

Suppose we have a group of people whose ages are [15, 16, 14, 15, 15, 19, 14, 17]. The average age in this groupo is 15.625. If a 80-year old person gets added to this group, its average age will now be 22.77 years, which does not seem to represent well the age profile of the group. The median age of this group in both cases, instead, is 15 years - i.e. the median value was not changed by the presence of an outlier in the data, which makes it a robust statistics for the ages of the group.

Now that all of the passengers' information has been "cleaned", we can start to analyse the data.

Exploratory analysis

Let's start by exploring how many people in this dataset survived the Titanic:



In [8]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [9]:

    
overall_fig = raw_data.Survived.value_counts().plot(kind='bar')
overall_fig.set_xlabel('Survived')
overall_fig.set_ylabel('Amount')









    Out[9]:





<matplotlib.text.Text at 0x119d124e0>

Overall, 38% of the passengers survived.

Now, let's segment the proportion of survivors along different profiles (the code to generate the following graphs was taken from this link).

By gender



In [10]:

    
survived_sex = raw_data[raw_data['Survived']==1]['Sex'].value_counts()
dead_sex = raw_data[raw_data['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([survived_sex,dead_sex])
df.index = ['Survivors','Non-survivors']
df.plot(kind='bar',stacked=True, figsize=(15,8));

By age



In [11]:

    
figure = plt.figure(figsize=(15,8))
plt.hist([raw_data[raw_data['Survived']==1]['Age'], raw_data[raw_data['Survived']==0]['Age']], 
         stacked=True, color=['g','r'],
         bins=30, label=['Survivors','Non-survivors'])
plt.xlabel('Idade')
plt.ylabel('No. passengers')
plt.legend();

By fare



In [12]:

    
import matplotlib.pyplot as plt

figure = plt.figure(figsize=(15,8))
plt.hist([raw_data[raw_data['Survived']==1]['Fare'], raw_data[raw_data['Survived']==0]['Fare']], 
         stacked=True, color=['g','r'],
         bins=50, label=['Survivors','Non-survivors'])
plt.xlabel('Fare')
plt.ylabel('No. passengers')
plt.legend();

The graps above indicate that passenger who are female, are less than 20 years and/or paid higher fares to embark have a greater chance to have survived the Titanic (what a surprise!). How precisely can we use this information to be able to predict if a passenger would survive the accident?

Predicting chances of surviving

Let's start by preserving onle the information that we wish to use - we'll keep the passenger names for further analysis:



In [13]:

    
data_for_prediction = raw_data[['Name', 'Sex', 'Age', 'Fare', 'Survived']]
data_for_prediction.is_copy = False
data_for_prediction.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 5 columns):
Name        889 non-null object
Sex         889 non-null object
Age         889 non-null float64
Fare        889 non-null float64
Survived    889 non-null int64
dtypes: float64(2), int64(1), object(2)
memory usage: 41.7+ KB

Numeric encoding of Strings

Some information is encoded as strins: the information about the passenger's gender, for instance, is represented by the strings male and female. To make use of this information in our coming predictive model, we must convert them to numeric values:



In [14]:

    
data_for_prediction['Sex'] = data_for_prediction.Sex.map({'male': 0, 'female': 1})
data_for_prediction.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 5 columns):
Name        889 non-null object
Sex         889 non-null int64
Age         889 non-null float64
Fare        889 non-null float64
Survived    889 non-null int64
dtypes: float64(2), int64(2), object(1)
memory usage: 41.7+ KB

Training/validation set split

In order to be able to assess the model's predictive power, part of the data (in this case, 25%) must be separated into a validation set.

A validation set is a dataset for which the expected vallues are known but that is not used to train the predictive model - this way, the model will not be biased with information from these entries and this data set can be used to estimate the error rate.



In [15]:

    
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data_for_prediction, test_size=0.25, random_state=254)
len(train_data), len(test_data)









    Out[15]:





(666, 223)

Predicting survival chances with decision trees

We'll use a simple Decision Tree model to predict if a passenger would survive the Titanic by making use of its gender, age and fare.



In [16]:

    
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier().fit(train_data[['Sex', 'Age', 'Fare']], train_data.Survived)
tree.score(test_data[['Sex', 'Age', 'Fare']], test_data.Survived)









    Out[16]:





0.80269058295964124

With a simple decision tree, the result above indicates that it's possible to correctly predict the survival of circa 80% of the passengers.

An interesting exercise to do after training a predictive model is to take a look at the cases where it missed:



In [17]:

    
test_data.is_copy = False
test_data['Predicted'] = tree.predict(test_data[['Sex', 'Age', 'Fare']])
test_data[test_data.Predicted != test_data.Survived]









    Out[17]:






  
    
      
      Name
      Sex
      Age
      Fare
      Survived
      Predicted
    
  
  
    
      207
      Albimona, Mr. Nassef Cassem
      0
      26.00
      18.7875
      1
      0
    
    
      660
      Frauenthal, Dr. Henry William
      0
      50.00
      133.6500
      1
      0
    
    
      81
      Sheerlinck, Mr. Jan Baptist
      0
      29.00
      9.5000
      1
      0
    
    
      762
      Barah, Mr. Hanna Assi
      0
      20.00
      7.2292
      1
      0
    
    
      446
      Mellinger, Miss. Madeleine Violet
      1
      13.00
      19.5000
      1
      0
    
    
      247
      Hamalainen, Mrs. William (Anna)
      1
      24.00
      14.5000
      1
      0
    
    
      43
      Laroche, Miss. Simonne Marie Anne Andree
      1
      3.00
      41.5792
      1
      0
    
    
      137
      Futrelle, Mr. Jacques Heath
      0
      37.00
      53.1000
      0
      1
    
    
      679
      Cardeza, Mr. Thomas Drake Martinez
      0
      36.00
      512.3292
      1
      0
    
    
      821
      Lulic, Mr. Nikola
      0
      27.00
      8.6625
      1
      0
    
    
      508
      Olsen, Mr. Henry Margido
      0
      28.00
      22.5250
      0
      1
    
    
      357
      Funk, Miss. Annie Clemmer
      1
      38.00
      13.0000
      0
      1
    
    
      748
      Marvin, Mr. Daniel Warner
      0
      19.00
      53.1000
      0
      1
    
    
      288
      Hosono, Mr. Masabumi
      0
      42.00
      13.0000
      1
      0
    
    
      712
      Taylor, Mr. Elmer Zebley
      0
      48.00
      52.0000
      1
      0
    
    
      238
      Pengelly, Mr. Frederick William
      0
      19.00
      10.5000
      0
      1
    
    
      804
      Hedman, Mr. Oskar Arvid
      0
      27.00
      6.9750
      1
      0
    
    
      71
      Goodwin, Miss. Lillian Amy
      1
      16.00
      46.9000
      0
      1
    
    
      429
      Pickard, Mr. Berk (Berk Trembisky)
      0
      32.00
      8.0500
      1
      0
    
    
      498
      Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
      1
      25.00
      151.5500
      0
      1
    
    
      692
      Lam, Mr. Ali
      0
      28.00
      56.4958
      1
      0
    
    
      147
      Ford, Miss. Robina Maggie "Ruby"
      1
      9.00
      34.3750
      0
      1
    
    
      245
      Minahan, Dr. William Edward
      0
      44.00
      90.0000
      0
      1
    
    
      18
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      1
      31.00
      18.0000
      0
      1
    
    
      259
      Parrish, Mrs. (Lutie Davis)
      1
      50.00
      26.0000
      1
      0
    
    
      271
      Tornquist, Mr. William Henry
      0
      25.00
      0.0000
      1
      0
    
    
      339
      Blackwell, Mr. Stephen Weart
      0
      45.00
      35.5000
      0
      1
    
    
      314
      Hart, Mr. Benjamin
      0
      43.00
      26.2500
      0
      1
    
    
      209
      Blank, Mr. Henry
      0
      40.00
      31.0000
      1
      0
    
    
      440
      Hart, Mrs. Benjamin (Esther Ada Bloomfield)
      1
      45.00
      26.2500
      1
      0
    
    
      673
      Wilhelms, Mr. Charles
      0
      31.00
      13.0000
      1
      0
    
    
      678
      Goodwin, Mrs. Frederick (Augusta Tyler)
      1
      43.00
      46.9000
      0
      1
    
    
      100
      Petranec, Miss. Matilda
      1
      28.00
      7.8958
      0
      1
    
    
      400
      Niskanen, Mr. Juha
      0
      39.00
      7.9250
      1
      0
    
    
      17
      Williams, Mr. Charles Eugene
      0
      28.00
      13.0000
      1
      0
    
    
      449
      Peuchen, Major. Arthur Godfrey
      0
      52.00
      30.5000
      1
      0
    
    
      312
      Lahtinen, Mrs. William (Anna Sylfven)
      1
      26.00
      26.0000
      0
      1
    
    
      637
      Collyer, Mr. Harvey
      0
      31.00
      26.2500
      0
      1
    
    
      796
      Leader, Dr. Alice (Farnham)
      1
      49.00
      25.9292
      1
      0
    
    
      737
      Lesurer, Mr. Gustave J
      0
      35.00
      512.3292
      1
      0
    
    
      146
      Andersson, Mr. August Edvard ("Wennerstrom")
      0
      27.00
      7.7958
      1
      0
    
    
      862
      Swift, Mrs. Frederick Joel (Margaret Welles Ba...
      1
      48.00
      25.9292
      1
      0
    
    
      78
      Caldwell, Master. Alden Gates
      0
      0.83
      29.0000
      1
      0
    
    
      473
      Jerwan, Mrs. Amin S (Marie Marthe Thuillard)
      1
      23.00
      13.7917
      1
      0

One example of a wrong prediction above is the case of passenger named Mrs. Hudson J C Allison, that didn't survive the Titanic despite being a female person, being 25 years old and having paid an expensive fare. A search on Encyclopedia Titanica reveals that she was informed, after having been put into a lifeboat, that her son was embarked in another lifeboat in the opposite side of the ship - Mrs. Allison then ran away from her boat in an attempt to reach to her son but to no avail.

A particularly interesting collection of stories related to the Titanic passengers can be found in this post.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Name	Sex	Age	Fare	Survived	Predicted
207	Albimona, Mr. Nassef Cassem	0	26.00	18.7875	1	0
660	Frauenthal, Dr. Henry William	0	50.00	133.6500	1	0
81	Sheerlinck, Mr. Jan Baptist	0	29.00	9.5000	1	0
762	Barah, Mr. Hanna Assi	0	20.00	7.2292	1	0
446	Mellinger, Miss. Madeleine Violet	1	13.00	19.5000	1	0
247	Hamalainen, Mrs. William (Anna)	1	24.00	14.5000	1	0
43	Laroche, Miss. Simonne Marie Anne Andree	1	3.00	41.5792	1	0
137	Futrelle, Mr. Jacques Heath	0	37.00	53.1000	0	1
679	Cardeza, Mr. Thomas Drake Martinez	0	36.00	512.3292	1	0
821	Lulic, Mr. Nikola	0	27.00	8.6625	1	0
508	Olsen, Mr. Henry Margido	0	28.00	22.5250	0	1
357	Funk, Miss. Annie Clemmer	1	38.00	13.0000	0	1
748	Marvin, Mr. Daniel Warner	0	19.00	53.1000	0	1
288	Hosono, Mr. Masabumi	0	42.00	13.0000	1	0
712	Taylor, Mr. Elmer Zebley	0	48.00	52.0000	1	0
238	Pengelly, Mr. Frederick William	0	19.00	10.5000	0	1
804	Hedman, Mr. Oskar Arvid	0	27.00	6.9750	1	0
71	Goodwin, Miss. Lillian Amy	1	16.00	46.9000	0	1
429	Pickard, Mr. Berk (Berk Trembisky)	0	32.00	8.0500	1	0
498	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	1	25.00	151.5500	0	1
692	Lam, Mr. Ali	0	28.00	56.4958	1	0
147	Ford, Miss. Robina Maggie "Ruby"	1	9.00	34.3750	0	1
245	Minahan, Dr. William Edward	0	44.00	90.0000	0	1
18	Vander Planke, Mrs. Julius (Emelia Maria Vande...	1	31.00	18.0000	0	1
259	Parrish, Mrs. (Lutie Davis)	1	50.00	26.0000	1	0
271	Tornquist, Mr. William Henry	0	25.00	0.0000	1	0
339	Blackwell, Mr. Stephen Weart	0	45.00	35.5000	0	1
314	Hart, Mr. Benjamin	0	43.00	26.2500	0	1
209	Blank, Mr. Henry	0	40.00	31.0000	1	0
440	Hart, Mrs. Benjamin (Esther Ada Bloomfield)	1	45.00	26.2500	1	0
673	Wilhelms, Mr. Charles	0	31.00	13.0000	1	0
678	Goodwin, Mrs. Frederick (Augusta Tyler)	1	43.00	46.9000	0	1
100	Petranec, Miss. Matilda	1	28.00	7.8958	0	1
400	Niskanen, Mr. Juha	0	39.00	7.9250	1	0
17	Williams, Mr. Charles Eugene	0	28.00	13.0000	1	0
449	Peuchen, Major. Arthur Godfrey	0	52.00	30.5000	1	0
312	Lahtinen, Mrs. William (Anna Sylfven)	1	26.00	26.0000	0	1
637	Collyer, Mr. Harvey	0	31.00	26.2500	0	1
796	Leader, Dr. Alice (Farnham)	1	49.00	25.9292	1	0
737	Lesurer, Mr. Gustave J	0	35.00	512.3292	1	0
146	Andersson, Mr. August Edvard ("Wennerstrom")	0	27.00	7.7958	1	0
862	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	1	48.00	25.9292	1	0
78	Caldwell, Master. Alden Gates	0	0.83	29.0000	1	0
473	Jerwan, Mrs. Amin S (Marie Marthe Thuillard)	1	23.00	13.7917	1	0